Detecting Malapropisms Using Measures of Contextual Fitness
نویسنده
چکیده
While detecting simple language errors (e.g. misspellings, number agreement, etc.) is nowadays standard functionality in all but the simplest text-editors, other more complicated language errors might go unnoticed. A difficult case are errors that come in the disguise of a valid word that fits syntactically into the sentence. We use the Wikipedia revision history to extract a dataset with such errors in their context. We show that the new dataset provides a more realistic picture of the performance of contextual fitness measures. The achieved error detection quality is generally sufficient for competent language users who are willing to accept a certain level of false alarms, but might be problematic for non-native writers who accept all suggestions made by the systems. We make the full experimental framework publicly available which will allow other scientists to reproduce our experiments and to conduct follow-up experiments. RÉSUMÉ. Alors que la détection d’erreurs simples est aujourd’hui une fonctionnalité standard des traitements de texte un peu évolués, de nombreuses erreurs restent difficiles à repérér. C’est souvent le cas lorsque la forme correcte est remplacée par une autre forme valide et syntaxiquement plausible en contexte. Nous avons utilisé les révisions de Wikipédia pour extraire automatiquement une listes d’erreurs de ce type. Ces données permettent de se faire une meilleure idée de l’utilité réelle des indicateurs standard de conformité contextuelle, qu’ils soient linguistiques ou statistiques. Les taux de détection obtenus sont généralement suffisants pour des scripteurs compétents qui seraient prêts à accepter un certain niveau de fausses alarmes ; ils restent problématiques pour des scripteurs nonnatifs. L’ensemble du dispositif expérimental utilisé pour ce travail sera rendu public, ce qui permettra à d’autres chercheurs de reproduire nos expériences et d’approfondir nos résultats.
منابع مشابه
Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History
We evaluate measures of contextual fitness on the task of detecting real-word spelling errors. For that purpose, we extract naturally occurring errors and their contexts from the Wikipedia revision history. We show that such natural errors are better suited for evaluation than the previously used artificially created errors. In particular, the precision of statistical methods has been largely o...
متن کاملDetection and Correction of Malapropisms in Spanish by Means of Internet Search
Malapropisms are real-word errors that lead to syntactically correct but semantically implausible text. We report an experiment on detection and correction of Spanish malapropisms. Malapropos words semantically destroy collocations (syntactically connected word pairs) they are in. Thus we detect possible malapropisms as words that do not form semantically plausible collocations with neighboring...
متن کاملExperiments in Detection and Correction of Russian Malapropisms by Means of the Web
Malapropism is a semantic error that is hardly detectable because it usually retains syntactical links between words in the sentence but replaces one content word by a similar word with quite different meaning. A method of automatic detection of malapropisms is described, based on Web statistics and a specially defined Semantic Compatibility Index (SCI). For correction of the detected errors, s...
متن کاملMalapropisms Detection and Correction using a Paronyms Dictionary, a Search Engine and Wordnet
This paper presents a method for the automatic detection and correction of malapropism errors found in documents using the WordNet lexical database, a search engine (Google) and a paronyms dictionary. The malapropisms detection is based on the evaluation of the cohesion of the local context using the search engine, while the correction is done using the whole text cohesion evaluated in terms of...
متن کاملOn Detection of Malapropisms by Multistage Collocation Testing
Malapropism is a (real-word) error in a text consisting in unintended replacement of one content word by another existing content word similar in sound but semantically incompatible with the context and thus destructing text cohesion, e.g.: they travel around the word. We present an algorithm of malapropism detection and correction based on evaluating the cohesion. As a measure of semantic comp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- TAL
دوره 53 شماره
صفحات -
تاریخ انتشار 2012